Duke University
I want to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno, and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.
library(rvest) package for harvesting websites/HTMLpurrr::map - Point out useful documentation & resourcesThis is a demonstration of leveraging the Tidyverse. This is not a research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex.
robots.txt | https://www.robotstxt.org
Step one:
Gather
ingest web page data for analysis
rvest::read_html()
Step two: Crawling
systematically (iterating) through a website, gathering data from more than one page (URL)
purrr::map()
Step three: Parsing
Separating the syntactic elements of a web page into meaningful data
rvest::html_nodes()
rvest::html_text()
rvest::html_attr()
Hypter Text Markup Language
Cascading Style Sheets
<html>
<body>
<div class="abc"> ... </div>
<div id="xyz">
<span class="foo"> ... </span>
</div>
<span id="bar"> ... </span>
</body>
</html>
for example: https://www.vondel.humanities.uva.nl/style.css
The basic workflow of web scraping is
Development
Production